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Abstract 

Autoencoders are unsupervised machine learning circuits whose learning goal 
is to minimize a distortion measure between inputs and outputs. Linear 
autoencoders can be defined over any field and only real-valued linear au- 
toencoder have been studied so far. Here we study complex-valued linear 
autoencoders where the components of the training vectors and adjustable 
matrices are defined over the complex field with the L 2 norm. We provide 
simpler and more general proofs that unify the real-valued and complex- 
valued cases, showing that in both cases the landscape of the error function 
is invariant under certain groups of transformations. The landscape has no 
local minima, a family of global minima associated with Principal Compo- 
nent Analysis, and many families of saddle points associated with orthogonal 
projections onto sub-space spanned by sub-optimal subsets of eigenvectors of 
the covariance matrix. The theory yields several iterative, convergent, learn- 
ing algorithms, a clear understanding of the generalization properties of the 
trained autoencoders, and can equally be applied to the hetero-associative 
case when external targets are provided. Partial results on deep architecture 
as well as the differential geometry of autoencoders are also presented. The 
general framework described here is useful to classify autoencoders and iden- 
tify general common properties that ought to be investigated for each class, 
illuminating some of the connections between information theory, unsuper- 
vised learning, clustering, Hebbian learning, and autoencoders. 
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1. Introduction 



Autoencoder circuits, which try to minimize a distortion measure be- 
tween inputs and outputs, play a fundamental role in machine learning. 
They were introduced in the 1980s by the Parallel Distributed Processing 
(PDP) group 2jj as a way to address the problem of unsupervised learn- 
ing, in contrast to supervised learning in backpropagation networks. More 
recently, autoencoders have been used extensively in the "deep architecture" 
approach ll|, Jj], 5], 10], where autoencoders in the form of Restricted Boltz- 



man Machines (RBMS) are stacked and trained bottom up in unsupervised 
fashion to extract hidden features and efficient representations that can then 
be used to address supervised classification or regression tasks. In spite of the 
interest they have generated, and with a few exceptions [20], little theoreti- 
cal understanding of autoencoders and deep architectures has been obtained 
to date. The main purpose of this article is to provide a complete theory 
for a particular class of autoencoders, namely linear autoencoders over the 
complex field. 

In addition to trying to progressively derive a more complete theoretical 
understanding of autoencoders, there are several other specific reasons for 
looking at linear complex- valued autoencoders. First, linear autoencoders 
over the real numbers were solved by Baldi and Hornik 0] (see also jsj). It is 
thus natural to ask whether linear autoencoders over the complex numbers 
share the same basic properties or not. More generally linear autoencoders 
can be defined over any field and therefore one can raise similar questions 



for linear autoencoders over other fields, such as finite Galois fields [14 
Second, there has been a general recent trend towards using linear networks 
to address difficult tasks in clever ways by introducing particular restrictions 
such as sparsity or low rank jsl, 0] . Autoencoders discussed in this paper can 
be viewed as linear low-rank approximations to the identity function. Third, 
complex vector spaces and matrices have several areas of specific application, 
ranging from quantum mechanics to fast Fourier transforms, and ought to 
be studied in their own right. The same can be said of complex- valued 



autoencoders and, more generally, complex-valued neural networks [13 



Here we provide a complete treatment of linear complex- valued autoen- 
coders. We first introduce a general framework and notation that are essen- 
tial for a deeper understanding of autoencoders, in particular to enable the 
identification of common properties that ought to be studied in any specific 
autoencoder case. While in the end the results obtained in the complex- 
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valued case are similar to those previously obtained in the real-valued case 
jij interchanging conjugate transposition with simple transposition, the ap- 
proach adopted here allow us to derive simpler and more general proofs that 
unify both cases. We also investigate in more detail several properties and 
derive several novel results addressing, for instance, learning algorithms and 
their convergence properties. Finally, in the Appendix, we begin the study of 
real-and complex-valued autoencoders from a differential geometry perspec- 
tive. 



2. General Autoencoder Framework and Preliminaries 

2.1. General Autoencoder Framework 
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Figure 1: An n/p/n Autoencoder Architecture. 

To derive a fairly general framework, an n/p/n autoencoder is defined by 
a t-uple F, G, n, p, A, B, X, A where: 

1. F and G are sets. 

2. n and p are positive integers. Here we consider primarily the case where 
< p < n. 

3. A is a class of functions from G p to F n . 

4. B is a class of functions from F n to G p . 
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5. X = {xi, . . . , x m } is a set of m (training) vectors in F n . When external 
targets are present, we let y = {yi, . . . , y m } denote the corresponding 
set of target vectors in F n . 

6. A is a dissimilarity or distortion function defined over F n . 

For any A & A and B G B, the autoencoder transforms an input vector 
x G F n into an output vector Ao B(x) G F n (Figure The corresponding 
autoencoder problem is to find A G A and B G B that minimize the overall 
distortion function: 

m m 

mm E(A, B) = min^£(x t ) = min A (A o B(x t ) , x t ) (1) 
' t=i ' t=i 

In the non auto-associative case, when external targets yt are provided, the 
minimization problem becomes: 

m m 

mm E(A, B) = min^E(x t ) = min ^ A ( A o B (x t ) , y t ) (2) 
' t=i ' t=i 

Note that p < n corresponds to the regime where the autoencoder tries 
to implement some form of compression or feature extraction. The case 
p > n is not treated here but can be interesting in situations which either 
(1) prevent the use of trivial solutions by enforcing additional constraints, 
such as sparsity, or (2) include noise in the hidden layer, corresponding to 
transmission over a noisy channel. 

Obviously, from this general framework, different kinds of autoencoders 
can be derived depending, for instance, on the choice of sets F and G, trans- 
formation classes A and £>, distortion function A, as well as the presence of 
additional constraints. Linear autoencoders correspond to the case where F 
and G are fields and A and B are the classes of linear transformations, hence 
A and B are matrices of size pxn and nxp respectively. The linear real case 
where F = G = M and A is the squared Euclidean distance was addressed in 
fl (see also 0). 

2.2. Complex Linear Autoencoder 

Here we consider the corresponding complex linear case where F = G = C 
and the goal is the minimization of the squared Euclidean distance 
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mm E(A, B) = min V \\x t - AB(x t )\\ 2 = V(x t - AB(x t ))*(x t - AB(x t )) 

A..B ' * 

*=i *=i 

(3) 

Unless otherwise specified, all vectors are column vectors and we use x* (resp. 
X*) to denote the conjugate transpose of a vector x (resp. of a matrix X). 
Note that the same notation works for both the complex and real case. As 
we shall see, in the linear complex case as in the linear real case, one can also 
address the case where external targets are available, in which case the goal 
is the minimization of the distance 



mmE(A, B) = mm V \\y t - AB(x t )\\ 2 = V(j/ t - AB(x t ))*(y t - AB(x t )) 

t=i t=i 

(4) 

In practical applications, it is often preferable to work with centered data, 
after substraction of the mean. The centered and non- centered versions of 
the problem are two different problems with in general two different solutions. 
The general equations to be derived apply equally to both cases. 
In general, we define the covariance matrices as follows 

= Yl Xt y* ( 5 ) 

t 

Using this definition, T, X x, ^yy are Hermitian matrices (E X j)* = T, X x and 
(£ yy )* = Syy, and (£ xy )* = S yx . We let also 

S = Tjyx^x^x^xy (6) 

S is also Hermitian. In the auto- associative case, Xt = yt for all t resulting 
in S = Y*xx- Note that any Hermitian matrix admits a set of orthonormal 
eigenvectors and all its eigenvalues are real. Finally, we let I m denote the 
m x m identity matrix. 

2.3. Useful Reminders 

Standard Linear Regression. Consider the standard linear regression 
problem of minimizing E(B) = ^ t \ \y t — Bx t \\ 2 , where B is a p x n matrix, 
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corresponding to a linear neural network without any hidden layers. Then 
we can write 

E{B) = Y,x*tB*Bx t - 2Re (y* t Bx t ) + \\y t \\ 2 (7) 
t 

Thus E is a convex function in B because the associated quadratic form is 
equal to 

^x* t C*Cx t = ^\\Cx t \\ 2 >U (8) 
t t 

Let B be a critical point. Then by definition for any p x n matrix C we 
must have lim e ^ [E(B + eC) — E(B)]/e = 0. Expanding and simplifying 
this expression gives 

x* t B*Cx t - y* t BCx t = (9) 

t 

for all p x n matrices C . Using the linearity of the trace operator and its 
invariance under circular permutation of its argument^, this is equivalent to 

Tr((£ XXJ B*-£ xy )C) = (10) 
for any C . Thus we have T^xxB* — S^y = and therefore 

BT^xx = Syx (11) 

If Yixx is invertible, then Cx t = for any t is equivalent to C — 0, and thus 
the function E(B) is strictly convex in B. The unique critical point is the 
global minimum given by B = T.yx^j^x- we shall see, the solution to the 
standard linear regression problem, together with the general approach given 
here to solve it, is also key for solving the more general linear autoencoder 
problem. The solution will also involve projection matrices. 
Projection Matrices. For any n x k matrix A with k < n, let Pa denote 
the orthogonal projection onto the subspace generated by the columns of A. 
Then Pa is a Hermitian symmetric matrix and P\ = Pa, PaA = A since the 



^Jt is easy to show directly that for any matrices A and B of the proper size, 
Tr(AB) = Tt(BA) [13]. Therefore for any matrices A, B, and C of the proper size, 
we have Tr {ABC) = Tr(CAB) = Tr{BCA). 
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image of P A is spanned by the columns of A and these are invariant under 
Pa- The kernel of Pa is the space A 1 - orthogonal to the space spanned by the 
columns of A with PaA ± = and A* Pa = A*. The projection onto the space 
orthogonal to the space spanned by the columns of A is given by I n — Pa- 
In addition, if the columns of A are independent (i.e. A has full rank k), 
then the matrix of the orthogonal projection is given by Pa = A(A*A)~ 1 A* 



16| and P* A = Pa- Note that all these relationships are true even when the 



columns of A are not orthonormal. 



2-4- Some Misconceptions 

As we shall see, in the complex case as in the real case, the global mini- 
mum corresponds to Principal Component Analysis. While the global min- 
imum solution of linear autoencoders over infinite fields can be expressed 
analytically, it is often not well appreciated that there is more to be un- 
derstood about linear autoencoders and the landscape of E. In particular, 
if one is interested in learning algorithms that proceed through incremental 
and somewhat "blind" weight adjustments, then one must study the entire 
landscape of E, including all the critical points of E, and derive and com- 
pare different learning algorithms. A second misconception is to believe that 
the problem is a convex optimization problem, hence somewhat trivial, since 
after all the error function is quadratic and the transformation W = AB 
is linear. The problem with this argument is that the small layer of size p 
forces W to be of rank p or less, and the set of matrices or rank at most 
p is not convex. Furthermore, the problem is not convex when finite fields 
are considered. What is true and crucial for solving the linear autoencoders 
over infinite fields is that the problem becomes convex when A or B is fixed. 
A third misconception, related to the illusion of convexity, is that the L 2 
landscape of linear neural networks never has any local minima. In general 
this is not true, especially if there are additional constraints on the linear 
transformation, such as restricted connectivity between layers so that some 
of the matrix entries are constrained to assume fixed values. 



3. Group Invariances 

For any autoencoder, it is important to investigate whether there are any 
group of transformations that leave its properties essentially invariant. 
Change of Coordinates in the Hidden Layer. Note that for any in- 
vertible p x p complex matrix C, we have W = AB = ACC~ 1 B and 



7 



E(A, B) = E(AC, C -1 B). Thus all the properties of the linear autoencoder 
are fundamentally invariant with respect to any change of coordinates in the 
hidden layer. 

Change of Coordinates in the Input/Output Spaces. Consider an 
orthornomal change of coordinates in the output space defined by an orthog- 
onal (or unitary) nxn matrix D, and any change of coordinates in the input 
space defined by an invertible n x n matrix C. This leads to a new autoen- 
coder problem with input vectors Cx±, . . . , Cx m and target output vectors of 
the form Dyi, . . . , Dy m with reconstruction error of the form 

E(A',B') = J2\\Dyt-A'B'Cx t \\ 2 (12) 

t 

If we use the one to one mapping between the pairs of matrices (A, B) and 
(A', B') defined by A' = DA and B' = BC, we have 

E(A',B') = J2\\Dyt-A'B'Cx t \\ 2 = \\Dy t -DABx t \\ 2 = ^ \\y t -ABx t \\ : 
t t t 

(13) 

the last equality using the fact that D is an isometry and preserves distances 
and angles. Thus, using the transformation A' = DA and B' = BC the orig- 
inal problem and the transformed problem are equivalent and the function 
E(A, B) and E(A', B') have the same landscape. In particular, in the auto- 
associative case, we can take C = D to be a unitary matrix. This leads to 
an equivalent autoencoder problems with input vectors Cx t and covariance 
matrix CSC" 1 . For the proper choice of C there is an equivalent problem 
where basis of the space is provided by the eignevectors of the covariance 
matrix and the covariance matrix is a diagonal matrix with diagonal entries 
equal to the eigenvalues of the original covariance matrix E. 

4. Fixed-Layer and Convexity Results 

A key technique for studying any autoencoder, is to simplify the problem 
by fixing all its transformations but one. Thus in this section we study what 
happens to the complex-valued linear autoencoder problem when either A 
or B is fixed, essentially reducing the problem to standard linear regression. 
The same approach can be applied to an autoencoder with more than one 
hidden layer (see section on Deep Architectures). 
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Theorem 1. (Fixed A) For any fixed nxp matrix A, the function E(A, B) 
is convex in the coefficients of B and attains its minimum for any B satisfying 
the equation 

A*ABZ XX = A*Z YX (14) 

IfYixx is invertible and A is of full rank p, then E is strictly convex and has 
a unique minimum reached when 

b = (a*a)- 1 a*j: yx i: x 1 x (15) 

In the auto-associative case, ifY* xx is invertible and A is of full rank p, then 
the optimal B has full rank p and does not depend on the data. It is given by 

B = {A* A)- 1 A* (16) 

and in this case, W = AB = A(A*A)~ 1 A* = P A and BA = I p . 

Proof. We write 

E(A, B) = J2 x* t B*A*ABx t - 2Re (y* t ABx t ) + | \y t \ | 2 (17) 

t 

Then for fixed A, E is a convex function because the associated quadratic 
form is equal to 

J24C*A*ACx t = \\ACx t \\ 2 > (18) 

t t 

for any p x n matrix C . Let B be a critical point. Then by definition for 
any p x n matrix C we must have lim € ^. [E(A, B + eC) — E(A, B)\/e = 0. 
Expanding and simplifying this expression gives 

x* t B*A*ACx t - y* t ACx t = (19) 

t 

for all p x n matrices C. Using the linearity of the trace operator and its 
invariance under circular permutation of its arguments, this is equivalent to 

Tr {{H XX B*A*A - ^ XY A)C) = (20) 
for any C. Thus we have Y, XX B*A*A — S X y^4 = and therefore 
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A*ABH X x = A*Y>yx (21) 

Finally, if Exx is invertible and if A is of full rank, then ACx t = for any 
t is equivalent to C = 0, and thus the function E(A, B) is strictly convex in 
B. Since A* A is invertible, the unique critical point is obtained by solving 
Equation [TH 

In similar fashion, we have the following theorem. 

Theorem 2 (Fixed B). For any fixed p x n matrix B, the function E '(A, B) 
is convex in the coefficients of A and attains its minimum for any A satisfying 
the equation 

ABZxxB* = Z YX B* (22) 

If Tjxx is invertible and B is of full rank, then E is strictly convex and has 
a unique minimum reached when 

A = Z YX B*(Bi: xx B*)- 1 (23) 

In the auto-associative case, ifE xx is invertible and B is of full rank, then 
the optimal A has full rank p and depends on the data. It is given by 

A = Z XX B*(BZ XX B*)- 1 (24) 

and BA = I p . 

Proof. From Equation (TTJ the function E(A, B) is a convex function in A. 
The condition for A to be a critical point is 

x* t B*A*CBx t - ytCBx t = (25) 

t 

for any p x n matrix C, which is equivalent to 

Tr ((Bi: xx B*A* - BE XY )C) = (26) 

for any matrix C. Thus BH XX B*A* — BH X y = which implies Equation I2"2l 
The other assertions of the theorem can easily be deduced. 

Remark 1. Note that from Theorems 1 and 2 and their proofs, we have that 
(A, B) is a critical point of E(A, B) if and only if Equation [7^1 and Equation 
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[U are simultaneously satisfied, that is if and only if A*ABT, XX = A*H Y x 
and ABY, XX B* = Y, YX B*. 

5. Critical Points and the Landscape of E 

In this section we further study the landscape of E, its critical points, 
and the properties of W = AB at those critical points. 

Theorem 3. (Critical Points) Assume that Xjo: invertible. Then two 
matrices (A, B) define a critical point of E, if and only if the global map 
W = AB is of the form 

W = Pa^yx^xx (27) 

with A satisfying 

P A S = P A ^P A = SP A (28) 
In the auto-associative case, the above becomes 

W = AB = P A (29) 

and 

Pa^xx = Pa^xxPa = ^xxPa (30) 

If A is of full rank, then the pair (A,B) defines a critical point of E if and 
only if A satisfies Equation {28\ and B satisfies Equation UM Hence B must 
also be of full rank. 

Proof. If (A, B) is a critical point of E, then from Equation dU we must 
have 

A*(AB-E Y xXxx) = (31) 

Let 

S = AB - Pa^yx^xx (32) 

Then since A*P A = A*, we have A*S = 0. Thus the space spanned by the 
columns of S is a subset of the space spanned by the columns of A. On the 
other hand, since 
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P A S = S (33) 

then A*S = implies S = 0. This proves Equation [271 Note that for this 
result, we only need that B is critical (i.e. optimized with respect to A). 
Using the fact that S = 

Pa^Pa — Pa^yx^j^x^xx^xx^xyPa = ABT, X xB*A* (34) 
Similarly, we have 

P A E = ABTjxy (35) 

and 

SP A = Z XX B*A* (36) 

Then Equation [28] result immediately using Equation [22j The rest of the 
theorem follows easily. 

Remark 2. The above proof unifies the cases when AB is of rank p and less 
than p and avoids the need for two separate proofs, as was done in earlier 
work [i] for the real-valued case. 

Theorem 4. (Critical Points of Full Rank) Assume that S is of full 
rank with n distinct eigenvalues Ai > • • • > A„ and let Ui, . . . ,u n denote a 
corresponding basis of orthonormal eigenvectors. If I = {ii, . . . , i p } (1 < i\ < 
. . . < i p < n) is any ordered set of indices of size p, let U% = (u^, . . . , Ui p ) 
denote the matrix formed using the corresponding column eigenvectors. Then 
two full rank matrices A, B define a critical point of E if and only if there 
exists an ordered p-index set T and an invertible p x p matrix C such that 

A = U T C and B = C^U^yx^xx ( 37 ) 
For such critical point, we have 

W = AB = P Ux T>y X Yg x (38) 

and 

E(A,B)=TrZ YY -Y, X i ( 39 ) 
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In the auto-associative case, these equations reduce to 



and 



A = U X C and B = C~ l U^ (40) 
W = AB = P Ux (41) 



E{A, B) = Tr £ - ]T A, = ^ A, (42) 

where X = {1, . . . , n}\X is the complement of I. 
Proof. Since PaE> = S-Paj we have 

P A Y,A = T,P A A = T.A (43) 

Thus the columns of A form an invariant space of S. Thus A is of the form 
UxC. The conclusion for B follows from Equation [27] and the rest is easily 
deduced, as in the real case. Equation H2] can be derived easily by using 
the remarks in Section 3 and using the unitary change of coordinates under 
which Tjxx becomes a diagonal matrix. In this system of coordinates, we 
have 

E(A, B) = I \Vt\ I 2 + E Tr {^{ABfABxt) ~ 2 ]T Tr (y* t ABx t ) 
t t t 

Therefore, using the invariance property of the trace under permutation, we 
have 

E(A, B) = Tr (E) + Tr ((AB) 2 £) - 2Tr {ABU) 

Since AB is a projection operator, this yields Equation |42j In the auto- 
associative case with these coordinates it is easy to see that W(x t ) and 
E(A,B) = J2tE( x t) ar e easily computed from the values of W{ui). In par- 
ticular, E(A, B) = Y17=i ^i( u i ~~ W(ui)) 2 . In addition, at the critical points, 
we have W{ui) — if i G I, and W{ui) = 0) otherwise. 

Remark 3. All the previous theorems are true in the hetero- associative case 
with targets y t - Thus they can readily be applied to address the linear de- 



noising autoencoder }24 - over M or C. The linear denoising autoencoder 



is an autoencoder trained to remove noise by having to associate noisy ver- 
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sions of the inputs with the correct inputs. In other words, using the current 
notation, it is an autoencoder where the inputs x t are replaced by x t + n t 
where n t is the noise vector and the target outputs y t are of the form y t = xt- 
Thus the previous theorems can be applied using the following replacements: 

SjfX = Sjfx + Sat AT + Xjvx + ^XN, ^XY = ^XX + ^NX, Syx = Sxx + Sxtv ■ 

Further simplifications can be obtained using particular assumptions on the 
noise, such as T>nx = Sxw = 0. 

Theorem 5. (Absence of Local Minima) The global minimum of the 
complex linear autoencoder is achieved by full rank matrices A and B as- 
sociated with the index set l,...,p of the p largest eigenvalues of S with 
A = UxC and B = C -1 C/J (and where C is any invertible p x p matrix). 
When C = I , A = B* . All other critical points are saddle points associated 
with corresponding projections onto non-optimal sets of eigenvectors of E of 
size p or less. 

Proof. The proof is by a perturbation argument, as in the real case, showing 
that critical points that are not associated with the global minimum there is 
always a direction of escape that can be derived using unused eigenvectors 
associated with higher eigenvalues in order to lower the error E. The proof 
can be made very simple by using the group invariance properties under 
transformation of the coordinates by a unitary matrix. With such a trans- 
formation, it is sufficient to study the landscape of E when S is a diagonal 
matrix and A = B* = U%- 

Remark 4. At the global minimum, if C is the p x p identity matrix (C = 
I), in the auto-associative case then the activities in the hidden layer are 
given by u\x, . . . , u*x, corresponding to the coordinates of x along the first p 
eigenvectors ofT^xx- These are the so called principal components of x and 
the autoencoder implements a form of Principal Component Analysis (PC A) 
also closely related to Singular Value Decomposition (SVD). 

The theorem above shows that when £ is full rank, there is a special class 
of critical points associated with C = I. In the auto-associative case, this 
class is characterized by the fact that A and B are conjugate transpose of 
each other (A = B*) in the complex-valued case, or transpose of each other 
(A = B*) in the real- valued case. This class of critical points is special for 
several reasons. For instance, in the related Restricted Boltzmann Machine 
Autoencoders the weights between visible and hidden units are require to be 
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Figure 2: Landscape of E. 



symmetric corresponding to A = B*. More importantly, these critical points 



are closely connected to Hebbian learning (see also [I7l ll8l ll9l|). In particular, 
for linear real-valued autoencoders, if A = B* and E = so that inputs are 
equal to outputs, any learning rule that is symmetric with respect to the 
pre- and post- synaptic activities-which is typically the case for Hebbian 
rules-will modify A and B but preserve the property that A = B*. This 
remains roughly true even if E is not exactly zero. Thus for linear real- valued 
autoencoders, there is something special about transposition operating on A 
and B and more generally on can suspect a similar role is played by conjugate 
transposition in the case of linear complex-valued autoencoders. The next 
theorem and the following section on learning algorithm further clarify this 
point. 

Theorem 6. (Conjugate Transposition) Assume Sxx is of full rank 
in the auto-associative case. Consider any point (A, B) where B has been 
optimized with respect to A, including all critical points. Then 



W = AB = B*A*AB = B*A* = W* and E(A, B) = E(B*, A*) (44) 
Furthermore, when A is full rank 
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W = P A = P* A = W* (45) 
Proof. By Theorem 1, in the auto-associate case, we have 

A*AB = A* 
Thus, by conjugating both sides, we have 

B*A*A = A 

It follows that 

B*A* = B*A*AB = AB 

which proves Equation SH If in addition A is full rank, then by Theorem 1 
W = AB = Pa and the rest follows immediately. 

Remark 5. Note the following. Starting from a pair (A, B) with W = AB 
and where B has been optimized with respect to A, let A' = B* and optimize 
B again so that B' = (A' A'*)^ 1 A'* . Then we also have 

W' = A'B' = W* = W = P A and E(A, B) = E(A', B') (46) 

6. Optimization or Learning Algorithms 

Although mathematical formula for the global minimum solution of the 
linear autoencoder have been derived, the global solution may not be avail- 
able immediately to a self-adjusting learning circuit capable of making only 
small adjustments at each learning steps. Small adjustments may also be 
preferable in a non-stationary environment where the set X of training vec- 
tors changes with time. Thus, from a learning algorithm standpoint it is 
still useful to consider incremental optimization algorithms, such as gradient 
descent. The previous theorems suggest two kinds of operations that could 
be used in various combinations to iteratively minimize E, taking full or 
partial steps: (1) Partial minimization: fix A (resp. B) and minimize for B 
(resp. A); (2) Conjugate Transposition: fix A (resp. B), and set B = A* 
(resp. A = B*) (the latter being reserved for the auto-associative case, and 
particularly so if one is interested in converging to solutions where A and B 
are conjugate transpose of each other, i.e. where C = I). 
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Theorem 7. (Alternate Minimization) Consider the algorithm where A 
and B are optimized in alternation (starting from A or B), holding the other 
one fixed. This algorithm will converge to a critical point of E. Furthermore, 
if the starting value of A or B is initialized randomly, then with probability 
one the algorithm will converge to a critical point where both A and B are 
full rank. 

Proof: A direct proof of convergence is given in Appendix B. Here we give 
an indirect, but perhaps more illuminating proof, by remarking that the 
alternate minimization algorithm is in fact an instance of the general EM 
algorithm j^] combined with a hard decision, similar to the Viterbi learning 
algorithm for HMM or the k-means clustering algorithm with hard assign- 
ment. For this, consider that we have a probabilistic model over the data 
with parameters A and hidden variables B, or vice versa, with parameters 
B and hidden variables A. The conditional probability of the data and the 
hidden variables is given by: 

P(X, y, A\B) = -L e - £(AB) (47) 

or 

P(X,y,B\A) = ^e- E ^ (48) 

where Zl and Z 2 denote the proper normalizing constants (partition func- 
tions). During the E step, we find the most probable value of the hidden 
variables given the data and current value of the parameters. Since E is 
quadratic, the model in Equation [48] is Gaussian and the mean and mode 
are identical. Thus the hard assignment of the hidden variables in the E step 
corresponds to optimizing A or B using Theorem 3 or Theorem 4. During the 
M step, the parameters are optimized given the value of the hidden variables. 
Thus the M step also corresponds to optimizing A or B using Theorem 3 or 
Theorem 4. As a result, convergence to a critical point of E is ensured by 
the general convergence theorem of the EM algorithm [9] . Since A and B are 
initialized randomly, they are full rank with probability one and, by Theorem 
1 and 2 they retain their full rank after each optimization step. Note that 
the error E is always positive, strictly convex in A or B, decreases at each 
optimization step, and thus E must converge to a limit. By looking at every 
other step in the algorithm, it is easy to see that Pa must converge. From 
which one can see that A must converge, and so must B. 
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Given the importance of conjugate transposition (Theorem ) in the auto- 
associative case, one may also consider algorithms where the operations of 
conjugate transposition and partial optimization of A and B are interleaved. 
This can be carried in many ways. Let A — > B denote that B is obtained 
from A by optimization (Equation H~6]) and A ==>- B denote that B is obtained 
from A by conjugate transposition (B = A*), and similarly for B — > A 
(Equation [21]) and B ==>- A (A — B*). Let also ^=>- denote the operation 
where both A and B are obtained by simultaneous conjugate transposition 
from their current values. Then starting from (random) A and B, here are 
several possible algorithms: 

• Algorithm 1: B — > A — > B — > A — > B . . .. 

• Algorithm 2: A — > B — > A — > B — > A.... 

• Algorithm 3: B — > A — > B A — > B — \ A — vB =>- A . . .. 

• Algorithm 4: A — > B — > A^ B — > A — > B — > A B . . .. 

• Algorithm 5: B — > A — > B B — > A — > B . . .. 

• Algorithm 6: A — ► B — > A <^=^ A — > B — > A . . .. 

• Algorithm 7: A < — B A < — B . . .. 

The theory presented so far allows us to understand their behavior easily 
(Figure |3]), considering a consecutive update of A and B as one iteration. 
Algorithms 1 and 2 converge with probability one to a critical point where 
A and B are full rank. Algorithm 1 may be slightly faster than Algorithm 
2 at the beginning since in the first step Algorithm 1 takes into account the 
data (Equation [2U whereas Algorithm 2 ignores it. Algorithms 3, 4, and 
5 converge and lead to a solution where A = B* (or, equivalently, C = I). 
Algorithms 3 and 5 take the same time and are faster than Algorithm 4. 
Algorithm 2 and Algorithm 4 take the same time. Algorithm 3 requires 
almost twice the number of steps of Algorithm 1. But Algorithm 4 is faster 
than Algorithm 3. This is because in Algorithm 3, the steps B A — > B 
is basically like switching the matrices A and B, and the error after the step 
B — > A — > B is the same as the error after the step B A — > B. 
Algorithms 6 and 7 in general will not converge. Only optimization steps 
with respect to the B matrix are being carried and therefore the data is 
never considered. 
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Figure 3: Learning Curves for Algorithms 1-6. The results are obtained using linear 
real-valued autoencoders of size 784-10-784 trained on images in the standard MNIST 
dataset for the digit "7" using 1,000 samples. Each consecutive update of both A and B 
is considered as one iteration. 

7. Generalization Properties 

One of the most fundamental problems in machine learning is to under- 
stand the generalization properties of a learning system. Although in general 
this is not a simple problem, in the case of the autoencoder the generaliza- 
tion properties can easily be understood. After learning, A and B must 
be at a critical point. Assuming without much loss of generality that A is 
also full rank and T^xx is invertible, then from Theorem 1 we know in the 
auto-associative case that W = Pa- Thus we have the following result. 

Theorem 8. (Generalization Properties) Assume in the auto-associative 
case that Exx is invertible. For any learning algorithm that converges to a 
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point where B is optimized with respect to A and A is full rank (including all 
full rank critical points), then for any vector x we have Wx = ABx = Pax 
and 

E(x) = \\x - ABx\\ 2 = \\x - P A x\\ 2 (49) 

Remark 6. Thus the reconstruction error of any vector is equal to the square 
of its distance to the subspace spanned by the columns of A, or the square of 
the norm of its projection onto the orthogonal subspace. The general hetero- 
associative case can also be treated using Theorem 1. In this case, under the 
same assumptions, we have: W = Pa^yx^XX 1 . 

8. Recycling or Iteration Properties 

Likewise, for the linear auto-associative case, one can also easily under- 
stand what happens when the outputs of the network are recycled into the 
inputs after learning. In the RBMs case, this is similar to alternatively sam- 
pling from the input and hidden layer. Interestingly, this provides also an 
alternative characterization of the critical points. At a critical points where 
W is a projection, we must have W 2 = W. Thus, after learning, the iterates 
W m x are easy to understand and converge after a single cycle and all points 
become stable after a single cycle. If x is in the space spanned by the columns 
of A we have W m (x) = x for any m > 1. If a; is not in the space spanned by 
the columns of A, then W m x = y for m > 2, where y is the projection of x 
onto the space spanned by the columns of A (Wx = Pax = y). 

Theorem 9. (Generalization Properties) Assume in the auto- associative 
case that T^xx is invertible. For any learning algorithm that converges to a 
point where B is optimized with respect to A and A is full rank (including all 
full rank critical points), then for any vector x and any integer m > 1, we 
have 

W m {x) = P%(x) = P A (x) (50) 

Remark 7. There is a partial converse to this result, in the following sense. 
Assume that W is a projection (W 2 = W) and therefore ABAB = AB. If 
A is of full rank, then BAB = B. Furthermore, if B is of full rank, then 
BA = I p (note that BA = I p immediately implies that W 2 = W). Multiplying 
this relation by A* A on the left and A on the right, yields A* AB = A* after 
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simplification, and therefore B = (A*A)~ 1 A* Thus according to Theorem 1 
B is critical and W = Pa- Note that under the sole assumption that W is 
a projection, there is no reason for A to be critical, since there is no reason 
for A to depend on the data and on T>xx ■ 

9. Deep Architectures 
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Figure 4: Vertical Composition of Autoencoders. 

Autoencoders can be composed vertically (Figure H]), as in the deep archi- 
tecture approach described in lll.ll2|. where a stack of RBMs is trained in an 
unsupervised way, in bottom up fashion, by using the activity in the hidden 
layer of a RBM in the stack as the input for the next RBM in the stack. Simi- 
lar architectures and algorithms can be applied to linear networks. Consider 
for instance training a 10/5/10 autoencoder and then using the activities 
in the hidden layer to train a 5/3/5 autoencoder. This architecture can be 
contrasted with a 10/5/3/5/10 architecture, or a 10/3/10 architecture. In 
all cases, the overall transformation W is linear and constrained in rank by 
the size of the smallest layer in the architecture. Thus all three architectures 
have the same optimal solution associated with Principal Component Analy- 
sis using the top 3 eigenvalues. However the landscapes of the error functions 
and the learning trajectories may be different and other considerations may 
play a role in the choice of an architecture. 

In any case, the theory developed here can be adapted to multi-layer real- 
valued or complex-valued linear networks. Overall, such networks implement 
a linear transformation with a rank restriction associated with the smallest 
hidden layer. As in the single hidden layer case, the overall distortion is 
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convex in any single matrix while all the other matrices are held fixed. Any 
algorithm that successively, or randomly, optimizes each matrix with respect 
to all the others will converge to a critical point, which will be full rank 
with probability one if the matrices are initialized randomly For instance, 
to be more precise, consider a network with five stages associated with the 
five matrices A, B, C, D and E of the proper sizes and the error function 
E(A, B, C, D, E) = £ t 1 1 (y t - ABCDEx t ) 1 1 2 . 

Theorem 10. For any fix set of matrices A, B, D and E, the function 
E(A, B, C, D, E) is convex in the coefficients of C and attains its minimum 
for any C satisfying the equation 

B* A* ABC DET^xxE* D* = B*A*T lY xE*D* (51) 

// T, X x is invertible and AB and DE are of full rank, then E is strictly 
convex and has a unique minimum reached when 

C = {B*A*AB)- 1 B*A*Y lY xE*D*(DEY<xxE*D*)- 1 (52) 
Proof: We write 

E(A, B) = J2 x t E *D*C*B*A*ABCDEx t - 2Re (y* t ABCDEx t ) + \\y t \\ 2 

(53) 

Then for fixed A, B, D, E, E is a convex function because the associated 
quadratic form is equal to 

J2 x *t E *D*L*B*A*ABLDEx t = ^ \ \ABLDEx t \\ 2 > (54) 
t t 

for any matrix L of the proper size. Let C be a critical point. Then 
by definition for any matrix L of the proper size, we must have lim e ^ 
[E(A, B,C + eL, D, E) - E(A, B, C, D, E)\/e = 0. Expanding and simpli- 
fying this expression gives 

x* t E* D* C* B* A* AB LD Ex t - y* t ABLDEx t = (55) 

t 

for all matrices C of the proper suez. Using the linearity of the trace oper- 
ator and its invariance under circular permutation of its arguments, this is 
equivalent to 
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Tr {(DEH XX E*D*C*B*A*AB - DEE XY AB)L) = (56) 

for any L. Thus we have DEE XX E*D*C*B*A*AB - DEZ XY AB = and 
therefore 

B* A* ABC DEY, XX E* D* = B*A*Z YX E*D* (57) 

Finally, if Y, xx is invertible and AB and DE are of full rank, then ABLDECx t = 
for any t is equivalent to L = 0, and thus the function E(A, B, C, D, E) is 
strictly convex in C. Thus in this case we can solve Equation [57] for C to get 
Equation [52j 

10. Conclusion 

We have provided a fairly complete and general treatment of complex- 
valued linear autoencoders. The treatment can readily be applied to special 
cases, for instance when the vectors are real-valued and the matrices are 
complex- valued, or the vectors are complex-valued and the matrices are real- 
valued. More importantly, the treatment provides a unified view of real- 
valued and complex-valued linear autoencoders. In the Appendix, we further 
broaden the treatment of linear autoencoders over infinite fields by looking 
at their properties from a differential geometry perspective. 

More broadly, the framework used here identifies key questions and strate- 
gies that ought to be studied for any class of autoencoders, whether linear 
or non-linear. For instance: 

1. What are the relevant group actions for the problem? 

2. Can one of the transformations (A or B) be solved while the other is 
held fixed? 

3. Are there any critical points, and how can they be characterized? 

4. Is there a notion of symmetry or transposition between the transfor- 
mations A and B around critical points? 

5. Is there an overall analytical solution? Is the problem NP-hard? What 
is the landscape of El 

6. What are the learning algorithms and their properties? 

7. What are the generalization properties? 

8. What happens if the outputs are recycled? 

9. What happens if autoencoders are stacked vertically? 
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All these questions can be raised anew for other linear autoencoders, for 
instance over K or C with the L p norm (p ^ 2), or over other fields, in par- 
ticular over finite fields with the Hamming distance. While results for finite 
fields will be published elsewhere, it is clear that these questions have differ- 
ent answers in the finite field case. For instance, the notion of analytically 
solving for A or B, while holding the other one fixed, using convexity breaks 
down in the finite field case. 

These questions can also be applied to non-linear autoencoders. While 
in general non-linear autoencoders are difficult to treat analytically, the case 
of Boolean autoencoders was recently solved using this framework pro- 
viding further insights into the unity of autoencoders. Boolean autoencoders 
implement a form of clustering when p < n and, in retrospect, all linear 
autoencoders implement also a form of clustering. In the linear case, for 
any vector x and any W = AB we have W(x + KerW) = W(x). KerW 
is the kernel of W which contains the kernel of B, and is equal to it when 
A is of full-rank. Thus, in general, linear autoencoders implement clustering 
"by hyperplane" associated with the kernel of B. Taken together, these facts 
point to the more general unity connecting unsupervised learning, clustering, 
Hebbian learning, and autoencoders. 

Finally, there is the case of autoencoders,linear or non-linear, with p > 
n which has not been addressed here. Clearly, additional restrictions or 
conditions must be imposed in this case, such as sparse encoding in the 
hidden layer using LI regularization, to avoid trivial solutions associated 
with the identity function. Although beyond the scope of this paper, these 
autoencoders are also of interest. For instance, the linear case over finite 
fields with noise added to the hidden layer, subsumes the theory of linear 



codes in coding theory (15 



Thus, in short, one can expect autoencoders to continue to play an impor- 
tant role in machine learning and provide fertile connections to other areas, 
from clustering to information and coding theory. 



Appendix A: Differential Geometry of Autoencoders 

Methods from differential geometry has been applied effectively to statis- 
tical machine learning in previous studies by Amari jl|, 0| and others. Here 
however we introduce a novel approach for looking at the manifolds of rel- 
evant parameters for linear autoencoders over the real or complex fields. 
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While the basic results in this section are not difficult, they do assume some 



understanding of the most basic concepts of differential geometry [22 . 

Let R p be the set of n x n complex matrices of rank at most equal to p. 
Obviously, AB G R p . In general, R p is a singular variety (a Brill-Noether 
variety). We let also R p \R p -i be the set of n x n matrices of rank exactly p. 
As we shall see, R p \R p _i is a complex manifold. 

Definition 1. We let 

in 

F P {W) = Y,\\yt-Wx t \\ 2 (58) 



t=i 



where W G R p . 



Let M pxq be the set of all p x q complex matrices. Define the mapping 

l : M nxp x M pxn R p with l{A, B) = AB (59) 

by taking the product of the corresponding matrices. Then we have Fol = E. 
We are going to show that i is surjective and the differential of i is of full 
rank at any point. 

Lemma 1. R p \R p _i is a complex manifold of dimension 2np — p 2 . 

Proof. Let W G R p \R p ^i. To construct a set of local coordinates of R p \R p _i 
near W, we write W 

W= {wx,--- ,w n ) (60) 

where wi, ■ ■ ■ ,w n are column vectors. Without any loss of generality, we 
assume that wi, • • ■ ,w p are linearly independent. Thus we must have 



^iij w i ( 61 ) 

i=l 

for j > p, with complex coefficients The local coordinates of R p \R p _i 
are {£,ij)i<i<p,p<j<n and (wik)i<i< Pt i<k<n- Thus 

dim(i?p\i? p _i) = p{n — p) + pn = 2pn — p 2 (62) 

Next, we consider the tangent space T w of R p \R p _i at W. By definition, 
a basis of T c (R p \R p -i) is given by 
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d 

1 < i < p, 1 < k < n; (63) 



dw ik 




1 < i < p,p < j < n. (64) 



t.j 



Let (ei, • • • , e n ) be the standard basis of C™. Then the corresponding 
matrices of the tangent vectors are 



(0, e k , •••,0,Ci 



dCifc i—th place 

d 



(0,... ,0,0,..- , Wi ,-..,0). 



j—thplace 

Lemma 2. Let = AB, where A, B are full-rank nxp and pxn matrices, 
respectively. Let Ai,B% be n x p and pxn matrices such that 

ABi + A\B = (65) 

Then there is an invertible p x p matrix V such that 

A 1 = AV, B 1 = -VB (66) 
Proof. By multiplying on the left by A*, we have 

A* ABi + A*A X B = 0. (67) 
Since A is full rank, A* A is an invertible p x p matrix. Thus 

B 1 = -{A*A)- 1 A*A 1 B (68) 
Substituting the above into Equation [65] yields 

-A(A*A)~ 1 A*A 1 B + A 1 B = (69) 
Since B is of full rank, we get 

-A(A*A)- 1 A*A 1 + A 1 = (1-P A )A 1 = (70) 
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which implies that the columns of A 1 span the same linear space as the image 
of Pa, i.e. the same space spanned by the columns of A. Hence A-y = AV 
for some p x p matrix V. 

Lemma 3. The tangent space Tw(R p \R p -i) is spanned by the matrices of 
the form 

AB 1 + A 1 B (71) 

where A and B are fixed, AB = W , and A 1 , B\ are nxp and pxn matrices, 
respectively. 

Proof. Define a linear map 



a : M nxp x M pxn ->■ M nxn with a(A u B^ = AB X + A X B (72) 
We have 

dimlm(a) = 2np — dimKer(o-) (73) 

By the above lemma, dimKer(a) = p 2 . Thus the image of a has the same 
dimension as the manifold R p \R p -\. Hence all the tangent vectors must be 
of the form AB X + A X B. 

Lemma 4. For any W G R p , there exist an n x p matrix A and a p x n 
matrix B such that W = AB. In other words, l is a surjective map. 



Proof. We use the following singular decomposition of matrices 

W = U 1 AU 2 , 



(74) 



where U\, U2 are unitary matrices and A is a diagonal matrix. Since W is of 
rank no more than p, we can write A as 



/A! 



A 



A, 



(75) 



0/ 
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Let (Ui) p represent the first p columns of U\ and let (U 2 ) p be the first p rows 
of U 2 - Let Ax be the first p x p minor of A. Then 

W = (U^A^Y (76) 

Thus the theorem is proved by letting A = (U\) p Ai and B = {U2) p . 

In general, R p is not a manifold. One of the resolution R p of R p is defined 
as follows 

R p = {(A,V) | A e R P ,V C kerA*, dimV = n - p} 

In this case, R p is a manifold and we can extend the function F to R p in a 
natural way: for (A, V) E R p , we let F p (A, V) = F p (A). 

By the convexity of the quadratic function Wilt — Ax t \\ 2 , we get the 
following conclusion 

Theorem 11. Both F p ,F p are convex functions on R p \R p _i. In particular, 
all critical points of the functions are global minima. 

Remark 8. By the relation E = F o l, we have 

D 2 E = D 2 F(Vl, Vt) + VF o D 2 l 

The first term on the right-hand side is always nonnegative by the convexity 
of F p . However the second term can be positive or negative, which partly 
explains why E is not convex and has many critical points that are saddle 
points. 

We end this section with the following result 
Theorem 12. Let 

E(A 1 ,--- ,A k ) = J2\\yt- A i--- A kX t \\ 2 , 

where Ai are (/ij, 5i) matrices. Let 

a = min(/ij, 5i) 

Then 

E(A ± , • • • , A k ) — F a (A 1 , • • ■ A k ) 
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Proof. The only non-trivial point is that any rank a matrix can be decom- 
posed into a product of the form A\ • • • A k , where Aj is a fij x 5j matrix. For 
k = 2, this is just Lemma 4. For k > 2, the statement can be proved using 
mathematical induction. 

Appendix B: Direct Proof of Convergence for the Alternate Mini- 
mization Algorithm 

It is expected that starting from any full rank initial matrices (Ai,Bi), 
if we inductively define 

Ak+i = T.yxBKBk'ExxBl)' 1 

B k +i = (A* k+1 A k+1 ) 1 A* k+1 T lY x^xxi 

then (A k ,B k ) should converge to a critical point of E. In this section, we 
prove the following 

Theorem 13. In the auto-associative case, assume that 

are different for different set X, where X is defined in Theorem [7} Then 
(Ak,Bk) converges to a critical point of E(A, B) . 

Proof. In the auto-associative case, the algorithm becomes 

A k +i = Y>B* k (B k Y>B* k y l 
Bk+i = {A* k+1 A k+ i) A* k+1 . 

Therefore, we obtain 

B k+ i = (-BfcS-Bfc)(-BfcS 2 -Bfc) _1 -BfcS (78) 

for k G N. We let V k be the vector space spanned by the rows of B k T,. Then 
in fact, we have 

B k+1 = B k P Vk (79) 
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where P Vk is the orthogonal projection onto the subspace V k . Define 



Ck = (Bk^Bl)(B k Y, 2 Bl) 1 B k T, = B k+1 , 

D k = B k — (B k TjBl){B k Tj 2 B* k Y 1 B k T 1 = B k — B k+1 

Then B k = C k + D k is the orthogonal decomposition of B k with respect to 
V k . 

In what follows, we use the Hilbert-Schmidt norm of a matrix: 



\\A\\ = y/Tr(A*A) (80) 

and study two different cases, depending on whether converges 
to or not. 

Case 1. | \D k \ \/\ \B k \ | does not converge to 0. 

In this case, there is an e > and a subsequence kj of N such that 

\\D k .\\/\\B k .\\ >e (81) 
Since B k = C k + D k is an orthogonal decomposition, we get 

ll^||<(l-eg) 1/2 ||^|| (82) 
Thus for any k > kj, we have 

= \\B kj P Vk \\ = \\C kj P Vk \\ < \\C kj \\ < (l-£o) 1/2 |l^ll (83) 
Since we always have 

l|S i+ i||< ll^-H (84) 

then if k > kj, we have 

||5 fe+1 ||<(l- 4) j/2 \\Bi\\ (85) 

and thus B k — > 0. 

Case 2. — > as k — > oo. In this case, since 

ll^fc+ill < ll^fcll (86) 
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D k — > and hence the limiting points B of {B k } must satisfy 

B - (BY 1 B*){BY?B*)- 1 BY, = B - BP V = (87) 

where V is the linear space spanned by the rows of -BE. It follows that the 
row space of B is invariant under E. Therefore, there is a p x p matrix 
and a subset X such that B = UjC. On the other hand, by the definition 
of A k ,B k , Theorem [U and Theorem [21 E(A k ,B k ) is a decreasing sequence. 
Since J2j^x^j are a ^ different, it is not possible for the sequence B k to have 
more than one limit point by Equation S2J 

In order to prove that A k is convergent as well, we first observe that in 
fact Case 1 cannot happen. By the recurrence relations, we have 

{BJIBIY 1 = (AlA k )(A* k XA k r\A* k A k ) (88) 

Thus 

A k+1 = EA k (Al^AkY 1 (A* k A k ) (89) 
If we write £ = SiEi for a Hermitian matrix Si, then we have 

Z^A k+1 = T, 1 A k (AlT,A k )~ 1 Al'El'E^ 1 A k = P^ x A k (90) 

where P is the projection onto the space spanned by the columns of T>iA k . 
It follows that 

Pr^M-iii < Pr^ii (91) 

under the operator norm. Since B k A k +\ = I p , B k will not converge to 0. 
Moreover, the limit of B k must be of full rank, since \ \A k \ \ is bounded. Under 
Case 1, B k — > which yields a contradiction. 
Since the limit (A, B) satisfies the equations 

A = T l B*(BT l B*y 1 , B = (A*A)- 1 A* 

by Theorem [1] and [21 (A, B) must be a critical point of E. 
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Examples. The convergence can be better seen when p = 1. Let 



V 



\ 



(92) 



A, 



/ 



with Ai > 
such that 



> A n . If p = 1, then there is a sequence c k of teal numbers 



B k+1 = c k B{L k (93) 

Let B\ = ,b n ) and let i be the smallest index such that 6j 7^ 0. 

Then 

fi fe = c fc (A?6i,--- ,X k n b n ) 

By Equation [791 ||-Bfc+i|| < ll-^fell- Thus the sequence CfcAf is bounded for 
k 00. It follows that for any j > i, CkXjbj — > as k — > 00. Therefore b k — > 
cei for some constant c by using Equation [791 again (where e^, • • • , e n is the 
standard basis of C n ). Moreover, c = 6j by a straightforward computation. 

The case of arbitrary p values can be addressed using the above example: 
let j < i and i — j + 1 = p. Let 



B l 



/0 



V 







et-i 



J 



(94) 



be a matrix of rank p with the same matrix S as above. Then 



e,- 



(95) 



In conclusion, for any saddle point, one can construct a sequence that con- 
verges to it. 
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